A meta-evaluation of the quality of reporting and execution in ecological meta-analyses

Quantitatively summarizing results from a collection of primary studies with meta-analysis can help answer ecological questions and identify knowledge gaps. The accuracy of the answers depends on the quality of the meta-analysis. We reviewed the literature assessing the quality of ecological meta-analyses to evaluate current practices and highlight areas that need improvement. From each of the 18 review papers that evaluated the quality of meta-analyses, we calculated the percentage of meta-analyses that met criteria related to specific steps taken in the meta-analysis process (i.e., execution) and the clarity with which those steps were articulated (i.e., reporting). We also re-evaluated all the meta-analyses available from Pappalardo et al. [1] to extract new information on ten additional criteria and to assess how the meta-analyses recognized and addressed non-independence. In general, we observed better performance for criteria related to reporting than for criteria related to execution; however, there was a wide variation among criteria and meta-analyses. Meta-analyses had low compliance with regard to correcting for phylogenetic non-independence, exploring temporal trends in effect sizes, and conducting a multifactorial analysis of moderators (i.e., explanatory variables). In addition, although most meta-analyses included multiple effect sizes per study, only 66% acknowledged some type of non-independence. The types of non-independence reported were most often related to the design of the original experiment (e.g., the use of a shared control) than to other sources (e.g., phylogeny). We suggest that providing specific training and encouraging authors to follow the PRISMA EcoEvo checklist recently developed by O’Dea et al. [2] can improve the quality of ecological meta-analyses.


Abstract:
Quantitatively summarizing results from a collection of primary studies with metaanalysis can help answer ecological questions and identify knowledge gaps. The accuracy of the answers depends on the quality of the meta-analysis. We reviewed the literature assessing the quality of ecological meta-analyses to evaluate current practices and highlight the areas that need improvements. From each of 18 review papers that evaluated the quality of meta-analyses, we calculated the percentage of meta-analyses that met criteria related to specific steps taken in the meta-analysis process (i.e., execution) and the clarity with which those steps were articulated (i.e., reporting). We also re-evaluated all the meta-analyses available from Pappalardo et al. (2020) to extract new information on ten additional criteria and additional information on the recognition and treatment of non-independence. In general, we observed better performance for criteria related to reporting than for criteria related to execution; however, there was wide variation among criteria and meta-analyses. Low to moderate compliance with execution criteria was particularly striking for controlling for phylogenetic non-independence, exploring temporal trends in effect sizes, and conducting a multifactorial analysis of moderators. The additional analysis of nonindependence showed that even though most meta-analyses included multiple effect sizes per study, only 66% acknowledged some type of non-independence. The types of non-independence reported were most often related to the design of the original experiment than related to other sources. We suggest that providing specific training and encouraging authors to follow the newly developed PRISMA EcoEvo checklist (O'Dea et al., 2021) can improve the quality of ecological meta-analysis. and will appear in the published article if the submission is accepted. Please make sure it is accurate.
Unfunded studies Enter: The author(s) received no specific funding for this work. The authors have declared that no competing interests exist. NO

Introduction
Meta-analyses evaluate summary statistics from primary studies to obtain aggregate effects, assess the heterogeneity of those effects, and ascertain possible causes of the observed heterogeneity. For example, meta-analysis has been used to quantify the strength of density-dependence [1], to assess the response of ecosystems to climate change [2], and to evaluate the performance of different management strategies [3]. Through synthesis, meta-analysis not only advances basic ecological theory, but also facilitates the application of ecological data to inform environmental policy [4].
Moreover, meta-analysis can help identify knowledge gaps, and thus direct new research endeavors [5]. Along with these benefits, the number of published meta-analyses is rapidly increasing [6,7], due to increased data availability, and pressing ecological questions that require synthetic research.
Despite their importance and wide application, meta-analyses vary in the quality of execution, and the subsequent reporting of meta-analytic methods is highly variable [5,[8][9][10]. If the quality of meta-analyses is poor, it is hard to know if "biological meta-analysis embodies 'mega-enlightenment', a 'mega-mistake', or something in between" [11]. One of the main issues that can prevent readers from evaluating the overall quality of a published meta-analysis is the lack of details describing each step in the meta-analysis (usually referred to as reporting quality). Poor reporting quality hinders the readers from assessing if the meta-analysis was executed properly and the results are reliable. New methodological guidelines specifically designed for ecology and evolutionary biology [PRISMA-EcoEvo, 12] provide authors, reviewers, and editors with a checklist of items with the goal of improving the overall quality of ecological meta-analysis. Wide adoption of these guidelines could greatly improve the quality of meta-analyses in ecology and evolutionary biology. Accessing the current status of compliance with the recommended best practice in ecological meta-analyses and identifying places that need improvements is thus a necessary first step to guide the meta-analytic community towards more robust inference.
In this paper, we reviewed the literature assessing the quality of ecological meta-analysis and collected new data to evaluate current practices and highlight the areas that need more work.
First, we compiled information from 18 studies in the last 20 years (between 2002 and 2022) that reviewed the quality of meta-analyses in ecology, evolution and related fields. These papers provide different insights on the compliance with different standards of reporting quality and with recommended steps that should be part of a meta-analysis. Second, we evaluated the recognition and treatment of non-independence for the ecological meta-analyses included in Pappalardo et al. [10], and extracted new data on quality criteria to compare it with the other reviews. Finally, we summarized the level of compliance for different quality criteria across the synthetic reviews and the new data.

Methods
To evaluate current practices when conducting and reporting ecological meta-analyses, we surveyed the literature for quantitative assessments on criteria previously identified as best practices in meta-analysis. These criteria fall into two broad categories: 1) execution (i.e., methodological issues), and 2) reporting. Both categories of criteria aim to ensure appropriate and reproducible results. Our list of criteria was informed by Koricheva & Gurevitch [5, Table 3] and the PRISMA EcoEvo checklist [12]. To find relevant papers, we first performed a exploratory search in Google Scholar, using combinations of keywords including "meta-analysis", "review", "quality", "ecology", "evolution". We then searched the Core Collection of the ISI Web of Science database including articles and reviews within the "Ecology", "Evolutionary Biology", "Biodiversity Conservation" and "Plant Sciences" categories (last search update on Sep 16, 2022).
We used a search string for TOPIC as: (["meta-analyses" OR "metaanalyses" OR "meta  The PRISMA diagram details the relevant literature sources identified, screened, excluded, assessed, and selected for the final analysis. Of the 19 final papers we compiled information from, we noticed that two of them review quality from a similar pool of papers (Romanelli et al., [25], Romanelli et al., [26]). Because of this, we excluded Romanelli et al. [26] from the final comparison.
We compiled information from 19 papers ( Table 1) that passed the inclusion criteria of having quantitative data on the quality of reporting or execution of meta-analysis in any of the criteria previously identified as target. A list of criteria and which paper contributed to each of them is provided in Table 2. We noticed that two of the reviews [25,26] compiled information from a similar pool of studies in restoration ecology, so we only extracted information for 18 papers, excluding Romanelli et al. [26] from the final comparison. Because different review papers used slightly different criteria in their reviews, we matched similar criteria; details on which information was extracted for each paper and criteria is provided in Appendix A. We obtained the proportion of the meta-analyses that comply with a particular criterion (data available in supplementary data files "compilation-of-previous-review-papers"). For the final analysis, we included only those criteria for which we could gather information from at least two review papers.

Table 1. Compilation of papers that reviewed the quality of reporting in ecological meta-analyses.
The "Review ID" (first author initials plus publication year) was used to identify review papers in Fig 1, Fig  2, and tables and figures in the Supplementary Material. Lodi et al., 2021 reviewed the quality of metaanalysis in two topic areas, we identified them with an additional code (lodi2021_fe and lodi2021_ee). There are two Romanelli et al., 2021 publications, we separated them as roma2021a and roma2021b, and extracted data from roma2021a because it had information for more of our target quality criteria and both publications reviewed the same pool of meta-analyses.

Publication
Review  2022), we counted the number of meta-analysis provided in the supplementary data file in the "retained meta-analyses" Table, 2013) reviewed a first round 133 papers, from which they quantified the percentage that did not report using weights. Then, from the 83 papers that passed their filter to be a "sound" meta-analysis, they quantified the percentage of papers that quantified and explored heterogeneity.
If there is overlap in the meta-analyses reviewed by these papers, comparisons between review papers may not be independent; however, because each review used a different set of search algorithms and often targeted a specific topic, such overlap is likely small. We quantified the overlap between meta-analysis reviews for all cases in which the full list of references was available (in the main text or in the supplementary material), or when the authors replied to our requests for this information. We used the first author's last name, journal, and year as the identification string to measure overlap in the number of publications shared between review papers.
In addition, because we had access to the full set of meta-analyses reviewed by Pappalardo et al. [10], we expanded on their results by adding additional criteria evaluated here (Table 2).
Pappalardo et al. [10] analyzed 96 meta-analyses related with global change (PRISMA diagram available in their Fig S1). For the criteria related to Reporting, we collected new information on: inclusion/exclusion criteria, the number of papers and the number of effect size estimates, the types of non-independence, identifying the statistical package used, identifying the specific functions used for the analyses, and providing the code that was used (when applicable). For the criteria related to Execution, we evaluated if the publication explored temporal changes in effects, conducted sensitivity analyses, controlled for phylogenetic non-independence, and tested for publication bias. More details on the calculations for each criterion are provided in Appendix A.
When a publication acknowledged non-independence (e.g., described some type of nonindependence), we also recorded the source of the non-independence being acknowledged, if the authors attempted to account for it, and the methods used to address non-independence. The source of non-independence coded as "sample" refers to non-independence among observed effect sizes arising from non-independent within-study error (e.g., multiple measurements of the same individual, shared control or treatments); and were coded as "study" when the non-independence arises from study-level correlation (e.g., multiple effect sizes from each publication which could generate random paper effects).
To code if a publication addressed non-independence, we used: "yes", when the publication described one or more sources of non-independence and addressed at least one; "partially" when the publication mentioned more than one source of independence, but did not address all of them; and "no" when the publication did not address non-independence. We coded the methods used to deal with non-independence as: 1) "average", when the non-independent values were averaged [e.g., 29: averaged repeated measurements across study durations,30: averaged across species]; 2) "choose", when the authors chose one value from multiple non-independent values [e.g., 31: used last sampling point,32: used one response variable per study]; and 3) "model", when the authors accounted for non-independence within the meta-analytic model [e.g., 33: included paper ID as a random effect,34: included variance-covariance matrix obtained from phylogenetic distances]; and 4) "tested", when non-independence was assessed, found not to be demonstrable, and was subsequently ignored [e.g., [35][36][37]. If a test was done and non-independence was supported, then the paper was coded according to the method used to address non-independence, and not as "tested" [e.g., 38].
We compiled the percentage of compliance for each criterion from the review papers and the new data, separating the criteria into Reporting and Execution types. We classified performance for each criterion as "high" when the percentage of papers complying with a criterion was >75%, "moderate" when compliance was >50% but <75%, "low" when compliance was >25% but <50%, and "very low" when compliance was <25%. The data collected is provided as supplementary data files. We analyzed and visualize data using the R software [39] and packages scales [40], flextable [41], pander [42], kableExtra [43], readxl [44], ggcharts [45], and tidyverse [46]. The code used to compile information, re-analyze data, and combine the final dataset used for analysis, is available in the supplementary material. Table 2. List of Reporting and Execution criteria compiled from reviews of meta-analyses.
Publications are identified using the Review ID provided in Table 1, and marked "yes" if they provided information for a specific criterion. For Pappalardo et al. [10], we indicated "yes" when we reanalyzed their published dataset, and "added" when we re-reviewed the meta-analyses to compile new data. Blank entries represent cases where data was not available for a specific criterion (likely because it was not the focus of the authors's review).

Overlap between review papers
The overlap between review papers was in general low. For Reporting criteria, the median number of shared papers was 2 and the mean 3; for Execution criteria, the median number of shared papers was 2 and the mean 4.7. In the Supplementary Material (section "Overlap between review papers") we included the overlap matrices for each quality criterion that show the number of papers that overlap between each review paper, and the distribution of number of papers shared (Fig S2, S3).

Compliance with Reporting and Execution criteria
In our compilation of synthesis reviews, we found wide variability in the compliance within and between the different quality criteria. We did not observe any clear differences among different subdisciplines (Figs. S4, S5), nor did we observe any temporal trends in compliance (Figs. S6, S7).
In general, we observed better compliance with criteria related to Reporting (Fig 2) than Execution (Fig 3). Across reviews, we observed high to moderate compliance with Reporting criteria such as: providing the list of references (Fig 2F), specifying the meta-analytic model ( Fig   2D), detailing inclusion/exclusion criteria (Fig 2C), and identifying the packages ( Fig 2J) and software (Fig 2K) used. On the other hand, Reporting criteria such as including full details on the literature search (Fig 2B), providing the data used to calculate effect sizes (Fig 2E), and providing the analytic code ( Fig 2G) and functions (Fig 2H) used exhibited very low, to moderate compliance. The percent of papers complying with each criterion is plotted for each synthesis paper. The colors indicate the overall performance for each criterion coded as: "high" (percentage compliance ≥ 75%), "moderate" (50 ≤ percentage compliance < 75), "low" (25 ≤ percentage compliance < 50), and "very low" (percent compliance < 25%). The Review ID corresponds to the papers listed in Table 1.
For the Execution criteria, there was lower compliance with criteria such as conducting sensitivity analysis (Fig 3A), controlling for phylogenetic non-independence (Fig 3B), exploring temporal changes in effect sizes (Fig 3D), conducting a multifactorial analysis of moderators (vs. multiple single factor analyses) (Fig 3E), and testing for publication bias (Fig 3G). In contrast, most papers explored the possible causes of heterogeneity ( Fig 3C). For the Execution criteria of weighing effect sizes by study precision (Fig 3H) and quantifying heterogeneity in effect sizes ( Fig   3F), compliance was highly variable. The percent of papers complying with each criterion is plotted for each synthesis paper. The colors indicate the overall performance for each criterion coded as: "high" (percentage compliance ≥ 75%), "moderate" (50 ≤ percentage compliance < 75), "low" (25 ≤ percentage compliance < 50), and "very low" (percent compliance < 25%). The Review ID corresponds to the papers listed in Table 1. In panel (A), Roberts [24] evaluated sensitivity analysis, and found 0% of papers reporting it.

Non-independence
In our review of the meta-analyses compiled by Pappalardo et al., [10], we found that in all metaanalyses but one, the number of effect sizes was larger than the number of papers (Fig 4). This variation was often of several orders of magnitude (Fig 4, note the log scale in both axis). This suggests the possibility of non-independence as effect sizes derived from the same source paper are more likely to be similar than those coming from different papers. 66% of the meta-analysis acknowledged some type of non-independence ( Fig 5A). The source of non-independence acknowledged most often (68% of the time) was related to the design of the original experiment (e.g., a common control used for different treatments) and how data was collected (Fig 5B).
Acknowledging non-independence from other sources of correlation (e.g., multiple experiments per publication) was less common (36% of the time, Fig 5B). The majority of papers (95%) that did acknowledge some source of non-independence addressed it (Fig 5C). The most common ways that non-independence was addressed ( Fig 5D) were: choose (50%), average (27%), model (14%) and tested (11%), with some papers using a combination of these approaches.

Fig 4: Relationship between the number of papers and the number of effect sizes included in ecological meta-analysis.
This plot was made by re-analyzing the compilation of meta-analyses by Pappalardo et al. [10]. Note that the axis are in log scale to accommodate two studies with extreme number of effect sizes. One is a study reporting 52 papers that did not specify the number of effect sizes and provided a dataset with 46,347 rows [47]. The other is a study that analyzed 1,785 papers and reported a total of 32,567 effects for one of their meta-analyses [48]. The dashed line indicates the abline for x = y.

Fig 5: Percent of papers that acknowledge non-independence, addressed it, and which methods they used to deal with non-independence.
(A) Acknowledged at least one type of independence in their data ("yes") or did not acknowledge nonindependence ("no"). (B) For the papers that did acknowledge non-independence, these are the sources of non-independence classified as "study" or "sample". (C) The percent of papers that addressed all of the types of non-independence mentioned ("yes"), a subset of the factors mentioned ("partially"), or did not address non-independence ("no"). (D) For the papers that did address non-independence, we show the methods used to address non-independence, classified as: "choose", when the authors chose one value from multiple non-independent values; "average", average, when the non-independent values were averaged; "model", when the authors accounted for non-independence within the meta-analytic model, and "tested" when the authors tested for the effects of non-independence and decided they were not significant. Please note that the papers that used more than one method (or source) were counted in each category, so the percentages between levels of Fig B and D do not add exactly to 100%.

Compliance with Reporting criteria
Even though there was overall good compliance for Reporting criteria (e.g., providing the list of primary papers included in the meta-analysis), many issues remain widespread. Meta-analysis papers were less consistent in their reporting of information that can critically affect the results of a meta-analysis [e.g., the inclusion/exclusion criteria 49]. Even minimal information such as the number of papers and effect sizes were not always included; for example, the review by Cadotte [6] showed that fewer than 50% of the meta-analyses reported this basic information. Many of these meta-analyses are not reproducible, because relatively few of the meta-analyses provided the data used to conduct their analyses (e.g., effect sizes, variances, moderators), or specified the model used to analyze the data (e.g., random-effects model). Similarly, for the meta-analysis that reported using a programming language, very few report the specific functions used for data analysis or provide their code, which are essential for reproducibility. Low code-sharing is not exclusive to meta-analysis; even for research articles published in ecological journals that encourage or mandate code-sharing, only 27% provide all or some of the code used for the analysis [50]. Making the data available provides great benefits for the research community, because it can be useful for meta-research or integrative research that combines that data in some novel way [51].
To encourage code and data sharing, journals can develop incentives for authors to publish their datasets or code (e.g., a discount on open access fees for authors sharing data and code). Most data repositories will provide a separate DOI for the dataset that can be properly cited. Most importantly, as reporting practices improve and data become available, reproducibility will improve. Achieving at least computational reproducibility will ensure results are robust and credible. This is particularly important for researchers working in applied science and conservation [52].
Having common accepted guidelines for meta-analysis could be a useful tool to improve quality of meta-analysis although empirical research on this topic often gives mixed results. Even before the PRISMA guidelines were initially developed [in 2009 by 53], systematic reviews in the medical field showed higher reporting quality compared with literature reviews in ecological metaanalysis [8,24]. This was likely due to the early guidelines for systematic reviews in the medical field using a standard set of methods developed by the Cochrane Collaboration [54]. How much the PRISMA guidelines improved the quality of reporting in medical meta-analysis is not clear.
Some papers report a moderate increase in reporting quality after the publication of PRISMA guidelines [55], others report no change [56, only reviewed abstracts]. For medical meta-analyses, the papers that compared self-reporting versus journals endorsing PRISMA found only higher quality of reporting when the journals endorse and implement the PRISMA guidelines [57,58]. In their synthesis review of meta-analyses in ecology and evolution O'Dea et al., [12] showed that the meta-analysis that reported to have followed some guidelines tended to have higher quality ratings. Lodi et al., [20] found higher quality ratings on meta-analyses in freshwater ecology from more recent years and suggest that previous papers on reporting guidelines are the reason for the improvement. We think that the recently published PRIMA Eco-Evo guidelines [12] could generate a bigger impact in improving the quality of reporting if journals require those guidelines for submission of meta-analyses. Some journals such as PLOS ONE already have a structure in place to detect if certain key aspects of a self reporting meta-analysis are present (e.g., a PRISMA plot). By pooling the references from all the review papers we analyzed, we can highlight the five journals that publish the highest number of meta-analysis and that will benefit the most of implement the PRISMA Eco-Evo guidelines in their submission process: 1) Ecology Letters (n=

Compliance with Execution criteria
The overall low compliance with Execution criteria suggests that most meta-analyses do not follow recommended methods. Below we discuss some of the topics usually associated with the recommended steps for a meta-analysis.
Weighting. One of the advantages of meta-analysis is that effect sizes are conventionally weighted by the precision of the observed effect size. Our compilation of reviews showed that the percentage of papers that weighted the effect sizes varied widely (from 33% to 93%). By reanalyzing the meta-analyses from Pappalardo et al. [10], we found that only 42% weighted by the inverse of the variance, 6% weighted by sample size, 16% used some non-traditional weight, and 36% of the meta-analyses did not weight effect sizes in any way to account for variation in their precision or quality. Several papers used unweighted analyses because of incomplete reporting in the primary publications (e.g., they did not report standard deviations or sample sizes), and they did not want to greatly reduce the number of studies by using a weighted meta-analysis.
New imputation techniques to estimate standard deviations could be a good alternative that allows conducting a weighted meta-analysis without throwing away data [59]. However, although weighting is generally recommended for meta-analysis because it is expected to yield more precise estimates (and is one of the criteria in PRISMA-EcoEvo), unweighted analyses could provide results as reliable as those obtained using weighted analyses, particularly when among-study variance is large relative to within-study variances (Song et al, pers. comm.). Thus, weighting is not necessarily a requirement of a well-executed meta-analysis. Conducting a sensitivity analysis with a smaller dataset that allows to compare unweighted and weighted meta-analyses can be a way to check if results are consistent [60].

Heterogeneity.
A central purpose of ecological meta-analysis is to quantify heterogeneity and explore its causes. A fixed-effects model assumes no heterogeneity among effect sizes, has been discouraged for ecological meta-analysis [61] and its use seems to be declining [27]. In fact, we showed (for most reviews) high compliance in exploring the causes of heterogeneity using explanatory variables (i.e., moderators), sometimes evaluated graphically. However, even when heterogeneity is addressed using a random-effects or mixed-effects model, metrics quantifying heterogeneity (e.g., using Q or I 2 statistics) are not always reported. Most reviews reported very low to moderate compliance on quantifying heterogeneity.
A common and prominent issues in exploring heterogeneity is that multiple explanatory variables are evaluated individually in separate models. This is an invalid approach because these explanatory variables may not be independent or because failure to simultaneously account for a factor may give rise to spurious results (e.g., via Simpson's paradox). A multifactorial analysis of moderators, which would address this issue, was reported only in a few meta-analyses. Gates [8] also mentioned that most meta-analyses did not correct for multiple testing when conducting subgroup analysis. This could reflect limitations imposed by software. For example, MetaWin, a commonly used software, does not allow multifactorial analyses. Additionally, Nakagawa & Santos [21] noted that meta-analytic data are often sparse and including all moderators in the model may result in great loss of sample size, making such analyses problematic [see 62]. This can be one of the reasons why multivariate analyses are less common [21]. However, given the increase in data and meta-data availability, and the availability of software with detailed documentation to run a multifactorial analysis of moderators [e.g., the R package metafor, 63], multifactorial analysis should become more feasible.
Sensitivity analysis. Sensitivity analyses are additional analyses conducted to test the robustness of a meta-analysis to methodological choices. They could involve explorations on how results change when removing influential points, or when including different types of information or weighting schemes, or when calculating different types of effect sizes. Sensitivity analysis can also be used to address issues related to non-independence, or at least to explore its consequences [64]. Across reviews, we observed that only a low percentage of meta-analyses reported having conducted a sensitivity analysis. This conclusion is likely conservative because we were not strict in our criteria and did not require the authors referring to these analyses explicitly as "sensitivity analysis". Although analysis for publication bias can be considered as a type of sensitivity analysis, we followed previous reviews [e.g., 5] and quantified them as separate criteria. It is possible that some researchers may have run additional explorations that could be considered a sensitivity analysis in the earlier stages of a publication, but these were not stated explicitly in their final manuscripts or supplementary information. We encourage authors of meta-analyses to include the results of sensitivity analysis, to showcase the different types of limitations related to their dataset, and to quantify if different methodological choices affect their conclusions. This will enhance the robustness of their meta-analysis.

Publication bias.
A general problem in scientific research is that significant results are more likely to be published. For meta-analysis, this may bias the meta-analyst towards discovering more significant effect sizes, which in turn may bias the conclusion of the meta-analysis. Meta-analysis has tools to identify the existence of publication bias (e.g., funnel plot) and to assess its impact (e.g., fail-safe number); these methods have pros and cons, and it is recommended to present results for at least two of them [11,21]. Despite the availability of specific methods, compliance to assess and report publication bias was low (<50%) in seven of the eight reviews that quantified this criterion.
A different type of publication bias in meta-analysis of ecology and evolution is temporal trends in effect sizes. For example, a decrease in the magnitude of effect sizes over time has been observed in various areas of ecology and evolution [65,66], although the existence of such a trend generally is debatable [67]. the percent of meta-analysis that addressed temporal trends in effect size was very low (ranging from 1 to 5%). Koricheva et al. [66] describe a wide range of methods available to explore and to analyze temporal trends.
Methods for detecting and quantifying the effects of publication bias, such as regressionor correlation-based approach for analyzing the asymmetry of funnel plots, may encounter challenges in ecology and evolution due to heterogeneity and non-independence, two characteristics commonly associated with data in ecology and evolution [22]. To address this issue, Nakagawa et al. [22] proposed using what they referred to as "conditional residuals" from hierarchical models instead of observed effect sizes in analyzing funnel plots. This approach accounts for heterogeneity and non-independence by subtracting the fixed effects and random effects that model the heterogeneity and non-independence from the observed effect sizes. We recommend that readers consult Nakagawa et al. [22] for details of these methods.

Non-independence. Non-independence is common for biological data; if not addressed
properly, non-independence can produce spurious results [69]. In particular, not properly accounting for non-independence often leads to wrong estimates of standard error and thus invalid statistical inference. Non-independence may occur at the sampling level, i.e., non-independent within-study error or at the study level, i.e. non-independent random effect. The former may occur through using shared control or repeated measurements of the same individuals over time while the latter may occur because of phylogeny or similar environment shared by a set of studies. Our review of the meta-analyses compiled by Pappalardo et al. [10] showed a higher percent of papers (66%) that acknowledged some source of non-independence, compared with Archmiller et al. [15] and O'Dea et al. [12] that report between 44% and 14% respectively. When looking at the sources of the non-independence, the majority of the non-independence acknowledged was at the sampling level. For example, authors often recognize non-independence caused by using shared control or repeated measurements when calculating effect sizes. The most common approach to addressing sampling level non-independence was choosing a subset of the data (36%) or using the average of the non-independent measurements (50%). We did not observe any meta-analyses that tried to explicitly model the covariance of the sampling level non-independence even though formula for covariance have been derived for many scenarios of non-independence [70,71]. In contrast, nonindependence arising from study-level correlation are much less recognized and addressed in metaanalyses. Only 14% of the meta-analyses from Pappalardo et al. [10] attempted to address studylevel non-independence. This is mostly in the form of using a hierarchical model. However, studylevel non-independence may arise in multiple ways in ecological meta-analyses [64]. In fact, we found that the number of effects far exceeds the number of papers in the meta-analyses we examined. Given that studies from the same source paper are more likely to share similar environments or methodology, it is very likely that study-level non-independence is common.
Thus, the relatively low proportion of published meta-analyses addressing such a source of nonindependence is a source of concern.
A particularly common case of study-level non-independence arises from phylogeny.
Closely related species may have similar traits that could be associated with similar responses; thus, data from different species may not be independent. In paired analysis of the same dataset using traditional and phylogenetic meta-analysis it has been shown that the magnitude of the effect size, the uncertainty around it (e.g., width of confidence interval), and its significance can change [17]. For example, Chamberlain et al. [17] report 40% of meta-analysis using random effects showing a difference in the significance of the effect size, with the majority of effects changing from significant to non-significant when a phylogenetic meta-analysis is used. The influence of phylogenetic relatedness on the outcome of meta-analysis has also been studied using simulations.
Cinar et al., [72] found that under moderately strong phylogenetic relatedness, failing to account for species-level variance will generate biased estimates of overall means and poor coverage. This is troubling given that all the review papers showed very low compliance to controlling for phylogenetic non-independence.

How can we implement best practices?
As has been highlighted in many in multiple sessions at the 2020 Ecological Society of America meeting, there is a need for data integration at multiple scales, data synthesis, and training of young investigators on computer programming and the use of appropriate statistical tools. To address this gap, in the long term, we think that it is important to train ecologists on meta-analysis techniques.
This could involve including meta-analysis topics in the curriculum of Ecology/Evolution graduate programs, which could be done as part of courses addressing statically methods and data analysis or be required for prospectus exams. Alternatively, training could be given as short workshops. Our compilation highlights that the recommended Execution steps for a biological metaanalysis are the ones that need more attention. Now that specific guidelines are available with a focus on meta-analysis in Ecology and Evolution, authors can follow the PRISMA EcoEvo checklist [12] as a guide to plan their meta-analysis, and reviewers and editors to assess the quality of reporting in a meta-analysis. More importantly, improving the reporting quality and following guidelines will also improve the quality of the research. Although we showed a better overall performance in following Reporting criteria, the compliance was highly variable across reviews which suggest ample room for improvement. Based on the results of Pappalardo et al. [10], we also think that the type of confidence intervals used to measure the uncertainty around a mean effect size should be added to this checklist. Perhaps making the PRISMA EcoEvo checklist a mandatory reading step during paper submission for papers self-reported as meta-analysis, could help to 1) making explicit if a paper is not a statistical meta-analysis, 2) help to encourage good reporting, reproducibility, and overall quality. A key component to future meta-analyses and synthesis studies are data sharing and good data management practices [78].
The learning curve to conduct a meta-analysis and follow all the detailed steps may be steep and discouraging. The training opportunities mentioned above could help reduce the learning curve to implement recommended steps. Some researchers argue that we should not let the perfect be the enemy of the good [79]. We think that clear reporting of the steps to conduct the analysis, and data availability, at least can ensure follow up analysis. Including more of the recommended data-analysis steps will make the inferences more robust, the ultimate goal of a statistical tool.

Supporting Information captions
Appendix A. Details on the information extracted from each paper for each performance criterion.
For data from Pappalardo et al. [10], we indicated when the data was collected in this study by rereviewing their compilation of ecological meta-analyses. When the number of publications complying (or not complying) with one of the criteria was reported, we used that information to calculate the percentage of papers complying; in other cases the reviews directly reported the information as a percentage. In a few papers in which we had the original review data for each criterion (e.g., [15]), we summed the number of papers complying with each criterion, and then calculated the percentage of compliance based on the total number of papers relevant for that criterion.
Appendix A. Details on the information extracted from each paper for each performance criterion.
Details on the information extracted from each paper for each performance criterion. For data from Pappalardo et al. (2020), we indicated when the data was collected in this study by re-reviewing their compilation of ecological meta-analyses with the tag "added". When the number of publications complying (or not complying) with one of the criteria was reported, we used that information to calculate the percentage of papers complying; in other cases the reviews directly reported the information as a percentage. In a few papers in which we had the original review data for each criterion (e.g., Archmiller et al., 2015), we summed the number of papers complying with each criterion, and then calculated the percentage of compliance based on the total number of papers relevant for that criterion.  Distribution of the number of papers shared between reviews for all the Reporting criteria combined.

Fig S3. Distribution of paper overlap for Execution criteria.
Distribution of the number of papers shared between reviews for all the Execution criteria combined. Table S1. Number of meta-analyses per journal that has been included in the meta-anlyses reviews. Because the distribution is strongly right skewed (with most journals publishing a few meta-analyses), we display only the journals with at least 5 meta-analyses. The percent of papers complying with each criterion is plotted for each synthesis paper. The colors indicate different subdisciplines of the review papers. The Review ID corresponds to the papers listed in Table 1 of the main manuscript. The percent of papers complying with each criterion is plotted for each synthesis paper. The colors indicate different subdisciplines of the review papers. The Review ID corresponds to the papers listed in Table 1 of the main manuscript.